feat(workflows): add continue_on_error step field for non-halting failures#2663
Open
doquanghuy wants to merge 1 commit into
Open
feat(workflows): add continue_on_error step field for non-halting failures#2663doquanghuy wants to merge 1 commit into
doquanghuy wants to merge 1 commit into
Conversation
e00e687 to
f34ab4c
Compare
Closes github#2591. Adds an optional `continue_on_error: bool` field on every step. When set to `true` and the step fails, the engine records the result (exit_code, stderr, status) into `steps.<id>.output` and continues to the next sibling step instead of halting the run. Downstream `if`, `switch`, or `gate` steps can then branch on `{{ steps.<id>.output.exit_code }}` to route the recovery path. This composes with primitives that already exist (the exit code is already captured, the expression engine already resolves it, and `if`/`switch`/`gate` are already available) — the only gap was that a non-zero exit hard-stopped the pipeline before any downstream step could evaluate it. ### Engine `WorkflowEngine._execute_steps` now consults the step config when a step returns `StepStatus.FAILED`: - Gate aborts (`output.aborted`) always halt the run — operator decisions take precedence over the flag. - Otherwise, if `continue_on_error: true`, log a `step_continue_on_error` event and proceed to the next sibling. - Otherwise, behave as before: set `RunStatus.FAILED` and return. ### Validation `_validate_steps` rejects non-bool values for `continue_on_error`. Coerced strings like `"true"` are not accepted so authoring mistakes surface at validation time rather than silently changing run semantics. ### Default behaviour preserved When `continue_on_error` is omitted, every code path is byte-equivalent to before this change. Existing workflows see no difference. ### Tests New `TestContinueOnError` class in `tests/test_workflows.py` covers all four scenarios from the issue's acceptance criteria plus two extras: - undeclared (default) failure halts the run. - declared-and-fired continues past the failure. - declared-but-step-succeeded is a no-op (flag only matters on FAILED). - if-branch end-to-end exercising the canonical recovery pattern from the issue discussion. - gate abort still halts even with `continue_on_error: true` set. - validation rejects non-bool values; accepts both `true` and `false` cleanly. ### Docs Adds an "Error Handling" section to `workflows/README.md` documenting the field, the gate-abort precedence rule, and the canonical recovery pattern. ### Follow-on Auto-retry-on-transient (e.g. retry a 429 at 3 AM without operator attendance) is intentionally out of scope. The current proposal covers the **skip** and **abort** verdicts from the original discussion; the **retry** verdict still pauses for an operator at the gate step. A future loop/retry-count primitive or an auto-approving gate could close that gap on top of this mechanism without further engine changes.
f34ab4c to
da8ed4d
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Closes #2591.
Adds an optional
continue_on_error: boolfield on every step.When set to
trueand the step fails, the engine records theresult (exit_code, stderr, status) into
steps.<id>.outputandcontinues to the next sibling step instead of halting the run.
Downstream
if,switch, orgatesteps can then branch on{{ steps.<id>.output.exit_code }}to route the recovery path.This is the shape @mnriem proposed in the issue discussion —
it composes with primitives that already exist (the exit code
is already captured, the expression engine already resolves it,
and
if/switch/gateare already available). The only gapwas that a non-zero exit hard-stopped the pipeline before any
downstream step could evaluate it.
Canonical usage
Engine
WorkflowEngine._execute_stepsnow consults the step config whena step returns
StepStatus.FAILED:output.aborted) always halt the run — operatordecisions take precedence over the flag.
continue_on_error: true, log astep_continue_on_errorevent and proceed to the next sibling.step_failed, setRunStatus.FAILED, and return.Exactly one event per failure-resolution path is logged so the
log timeline is unambiguous: either the run continued past the
failure or it halted.
Validation
_validate_stepsrejects non-bool values forcontinue_on_error.Coerced strings like
"true"are not accepted so authoringmistakes surface at validation time rather than silently
changing run semantics.
Default behaviour preserved
When
continue_on_erroris omitted, every code path isbyte-equivalent to before this change. Existing workflows see no
difference.
Verdict coverage (from the issue discussion)
continue_on_error: true+ifbranches around the failurecontinue_on_error: true+gate→ operator approves →resumere-runs from gateFully unattended retry-on-transient (e.g. retry a 429 at 3 AM
without operator attendance) is intentionally out of scope here.
The skip and abort verdicts work without a human; the
retry verdict still pauses for one at the gate. A future
loop/retry-count primitive or an auto-approving gate type could
close that gap on top of this mechanism without further engine
changes — happy to follow up on that in a separate issue if
useful.
Testing
uv run specify --helpuv sync && uv run pytest→ 2967 passed, 35 skipped (was 2960 before; +7 new
tests added in this PR).
the middle step exits non-zero. Without
continue_on_error, run halts at the failing step (asbefore). With
continue_on_error: true, the failing steprecords
exit_codeand the third step executes. Adownstream
ifbranching on{{ steps.flaky.output.exit_code != 0 }}routes into arecovery
gatecleanly.New test coverage
TestContinueOnErrorintests/test_workflows.py:test_undeclared_failure_halts_runtest_declared_and_fired_continues_runtest_declared_but_step_succeeded_is_nooptest_if_branch_routes_around_failuretest_gate_abort_still_halts_with_continue_on_errortest_validation_rejects_non_bool_continue_on_error"true"(string) fails validation.test_validation_accepts_bool_continue_on_errortrueandfalsepass validation cleanly.AI Disclosure
Used Claude Opus to draft the engine change, the test suite, the
docs section, and this PR body. The shape (
continue_on_errorproposed by @mnriem on the issue thread; this PR implements that
proposal. Code, tests, and design decisions were human-reviewed
before submission.